Statistical Machine Translation : Robust parameter estimation from noisy corpus

نویسنده

  • Shashi Mittal
چکیده

In this report, we describe our study of effect of noise on parameter estimation for statistical machine translation. So far, no study has been done on this topic, even though the algorithm used for parameter estimation for statistical machine translation (the EM algorithm) is known to be highly sensitive to noise. We present in detail the experiments performed to observe the influence of noise on parameter estimation, and the various methods investigated to counter this effect.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Two Ways to Use a Noisy Parallel News Corpus for Improving Statistical Machine Translation

In this paper, we present two methods to use a noisy parallel news corpus to improve statistical machine translation (SMT) systems. Taking full advantage of the characteristics of our corpus and of existing resources, we use a bootstrapping strategy, whereby an existing SMT engine is used both to detect parallel sentences in comparable data and to provide an adaptation corpus for translation mo...

متن کامل

Discriminative Corpus Weight Estimation for Machine Translation

Current statistical machine translation (SMT) systems are trained on sentencealigned and word-aligned parallel text collected from various sources. Translation model parameters are estimated from the word alignments, and the quality of the translations on a given test set depends on the parameter estimates. There are at least two factors affecting the parameter estimation: domain match and trai...

متن کامل

Using Noisy Bilingual Data for Statistical Machine Translation

SMT systems rely on sufficient amount of parallel corpora to train the translation model. This paper investigates possibilities to use word-to-word and phrase-to-phrase translations extracted not only from clean parallel corpora but also from noisy comparable corpora. Translation results for a Chinese to English translation task are given.

متن کامل

Statistical Machine Translation with Word- and Sentence-Aligned Parallel Corpora

The parameters of statistical translation models are typically estimated from sentence-aligned parallel corpora. We show that significant improvements in the alignment and translation quality of such models can be achieved by additionally including wordaligned data during training. Incorporating wordlevel alignments into the parameter estimation of the IBM models reduces alignment error rate an...

متن کامل

Robust Estimation of Feature Weights in Statistical Machine Translation

Weights of the various components in a standard Statistical Machine Translation model are usually estimated via Minimum Error Rate Training. With this, one finds their optimum value on a development set with the expectation that these optimal weights generalise well to other test sets. However, this is not always the case when domains differ. This work uses a perceptron algorithm to learn more ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005